1. Introduction

Machine learning applied for real estate industry is relevant, as it allows real estate companies to set selling prices or rent efficiently. Real estate agents only need to visit the property and gather necessary information (e.g., square meters, number of rooms, bathrooms, etc.) that are then translated to data. Thereafter, machine learning model provides rapid and efficient estimates. Rather than relying solely on the agents’ prediction, machine learning brings a more scientific and data driven approach which provides a benchmark that can be adjusted by the agent based on his/her experience and knowledge.

In this project, several models of supervised and unsupervised machine learning are used to predict housing and apartment rent in Brazil. The fact that the data set relies not on the prediction of the rent but of the country where the data is collected is what makes this project original. These types of data sets are commonly from the United States and are rarely from Latin America.

This project aims to analyze observations in order to have a better understanding of the data, through investigation of variables in the data set, processing different variable types (e.g., dummy, factor, numeric, categorical, etc.). Furthermore, discover the use of wide range models, discuss and compare them and finally select the best model regarding this data set.

2. Data

We have found our dataset on Kaggle, a platform gathering an online data scientist community. There were two versions of the dataset, but we picked the second one as it was the most complete and recent one. This dataset contains 10’692 properties to rent:

Dataset Source Type Observations Features Description
Brazilian houses to rent v2 https://www.kaggle.com CSV 10’692 13 city, area [m2], rooms, bathroom, parking spaces, floor, animals, furniture, in BRL (R$): HOA, rent amount, property tax, fire insurance, total

For the moment, we are keeping total, but we will remove it when we fit the models as the value we want to predict rent is contained in total.

2.1 Summary

No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 city [factor] 1. Belo Horizonte 2. Campinas 3. Porto Alegre 4. Rio de Janeiro 5. São Paulo
1258(11.8%)
853(8.0%)
1193(11.2%)
1501(14.0%)
5887(55.1%)
0 (0.0%)
2 area [integer] Mean (sd) : 149 (537) min < med < max: 11 < 90 < 46335 IQR (CV) : 126 (3.6) 517 distinct values 0 (0.0%)
3 rooms [integer] Mean (sd) : 2.5 (1.2) min < med < max: 1 < 2 < 13 IQR (CV) : 1 (0.5) 11 distinct values 0 (0.0%)
4 bathroom [integer] Mean (sd) : 2.2 (1.4) min < med < max: 1 < 2 < 10 IQR (CV) : 2 (0.6)
1:4301(40.2%)
2:2910(27.2%)
3:1474(13.8%)
4:1111(10.4%)
5:578(5.4%)
6:215(2.0%)
7:85(0.8%)
8:11(0.1%)
9:4(0.0%)
10:3(0.0%)
0 (0.0%)
5 parking.spaces [integer] Mean (sd) : 1.6 (1.6) min < med < max: 0 < 1 < 12 IQR (CV) : 2 (1) 11 distinct values 0 (0.0%)
6 floor [factor] 1. - 2. 1 3. 10 4. 11 5. 12 6. 13 7. 14 8. 15 9. 16 10. 17 [ 25 others ]
2461(23.0%)
1081(10.1%)
357(3.3%)
303(2.8%)
257(2.4%)
200(1.9%)
170(1.6%)
147(1.4%)
109(1.0%)
96(0.9%)
5511(51.5%)
0 (0.0%)
7 animal [factor] 1. acept 2. not acept
8316(77.8%)
2376(22.2%)
0 (0.0%)
8 furniture [factor] 1. furnished 2. not furnished
2606(24.4%)
8086(75.6%)
0 (0.0%)
9 hoa..R.. [integer] Mean (sd) : 1174 (15592) min < med < max: 0 < 560 < 1117000 IQR (CV) : 1068 (13.3) 1679 distinct values 0 (0.0%)
10 rent.amount..R.. [integer] Mean (sd) : 3896 (3408) min < med < max: 450 < 2661 < 45000 IQR (CV) : 3470 (0.9) 1195 distinct values 0 (0.0%)
11 property.tax..R.. [integer] Mean (sd) : 367 (3108) min < med < max: 0 < 125 < 313700 IQR (CV) : 337 (8.5) 1243 distinct values 0 (0.0%)
12 fire.insurance..R.. [integer] Mean (sd) : 53.3 (47.8) min < med < max: 3 < 36 < 677 IQR (CV) : 47 (0.9) 216 distinct values 0 (0.0%)
13 total..R.. [integer] Mean (sd) : 5490 (16485) min < med < max: 499 < 3582 < 1120000 IQR (CV) : 4706 (3) 5751 distinct values 0 (0.0%)

2.2 Wrangling/cleaning

2.2.1 Modifying variables

Looking at the summary and head of our dataset, we observed that we need to modify some variables. We renamed some of them to have a correct display of their name. According to the author of the dataset, the value “-” in floor feature correspond to houses on the ground level. Therefore, we changed it to 0.

city area rooms bathroom parking floor animal furniture hoa rent property tax fire insurance total
Sao Paulo 70 2 1 1 7 yes yes 2065 3300 211 42 5618
Sao Paulo 320 4 4 0 20 yes no 1200 4960 1750 63 7973
Porto Alegre 80 1 1 1 6 yes no 1000 2800 0 41 3841
Porto Alegre 51 2 1 0 2 yes no 270 1112 22 17 1421
Sao Paulo 25 1 1 0 1 no no 0 800 25 11 836
Sao Paulo 376 3 3 7 0 yes no 0 8000 834 121 8955

2.2.2 Missing Data

As seen in the last column of the summary, we don’t have any missing values.

2.2.3 Duplicates

We have several instances that are duplicates. We only need to keep the first occurrence of the duplicates. By removing them, we go from 10’692 instances to 10’334 observations.

city area rooms bathroom parking floor animal furniture hoa rent property tax fire insurance total
Belo Horizonte 15 1 1 0 3 no yes 0 1100 0 15 1115
Belo Horizonte 15 1 1 0 3 no yes 0 1100 0 15 1115
Belo Horizonte 15 1 1 0 3 no yes 0 1100 0 15 1115
Belo Horizonte 20 1 1 0 6 no no 420 1050 79 14 1563
Belo Horizonte 20 1 1 0 6 no no 420 1050 79 14 1563
Belo Horizonte 20 1 1 1 0 yes yes 0 1100 0 15 1115

2.2.4 Distribution of data and outliers

Looking at some of the features in the summary, we see that there is a high possibility of having outliers in area, hoa and property_tax. We can quickly check all the features distribution with boxplots.

We should try to apply to some of the features a log transformation to better see their distribution. But before applying the transformation, we should look into some visible outliers. As we had guessed, area, hoa and property_tax have some extreme values. floor and rent also seem to have outliers. By looking at the top 5 values of each of these features, we can find some discrepencies:

Highest Area (m2)
city area floor rent hoa property tax
Belo Horizonte 46335 11 8500 960 646
Sao Paulo 24606 12 8100 2254 7859
Campinas 12732 3 1600 700 96
Belo Horizonte 2000 0 4956 0 200
Belo Horizonte 2000 2 5000 3000 200
Highest Hoa (R$)
city area floor rent hoa property tax
Belo Horizonte 155 4 2790 1117000 64
Sao Paulo 340 7 12000 220000 1000
Sao Paulo 285 6 20000 200000 1834
Rio de Janeiro 35 1 4500 81150 9900
Porto Alegre 42 10 700 32000 40
Highest Property Tax (R$)
city area floor rent hoa property tax
Sao Paulo 42 6 2500 690 313700
Rio de Janeiro 95 11 3000 1024 28120
Sao Paulo 700 0 10000 0 21880
Belo Horizonte 260 10 11900 3200 12500
Sao Paulo 300 6 1990 3500 10830
Highest Rent (R$)
city area floor rent hoa property tax
Sao Paulo 700 0 45000 0 8750
Sao Paulo 350 0 30000 0 560
Sao Paulo 486 0 25000 0 2200
Sao Paulo 80 1 24000 875 0
Sao Paulo 660 12 20000 4800 1750

Area: We observe 3 potential outliers. It is improbable to have a property of 46’335 m2 at the 11th floor, or a 24’606 m2 and 12’732 m2 house at the 12th and 3rd floor. We will remove these values from our data set.

HOA:The first five values are probably outliers, as HOA is way too high compared to the rent. Therefore, we will remove these five observations. We have also checked the values after the five first observations, but we didn’t find anything abnormal.

Property Tax: It seems unlikely that for an area of 42m2, you pay 313’700 BRL of property tax. This is why we will remove this observation.

Rent: The 3 highest rents seem very high, but knowing that they are houses (floor = 0), it is still possible. This is why we will keep these observations. However, for the 4th highest rent of our data set, it looks a bit suspicious to have a rent of 24’000 BRL for a 80m2 on 1st floor, knowing that the average rent per month is about 5000 BRL in the city centre 1. This is why we will remove this observation.

Highest Floor by City
city floor
Belo Horizonte 301
Campinas 51
Rio de Janeiro 25
Porto Alegre 23
Sao Paulo 46

Floor: By checking the highest buildings for each cities 2, and comparing them with the highest floors of our data set, we see that we need to remove two outliers, as they have higher floors than in reality.


After these outliers removal, we have 10’323 observations left. We can now log transform some features in our dataset.

We observe for hoa, that the data distribution in São Paulo and Belo Horizonte is very large. This is due to the fact that for some properties, there is no homeowners association tax.

For the floor feature, we have decided to modify it into intervals:

Floor Count Foor Count
0 2371 6 528
1 998 7 487
2 952 8 486
3 894 9-12 1260
4 726 13-18 783
5 580 19+ 257

2.3 Correlation

We see a very high correlation of 0.99 between fire_insurance, and the variable we would like to predict rent. We will need to remove this feature as fire_insurance is probably calculated from the rent. There is also as expected a high correlation with total. As we previously said, we will need to remove this variable when fitting the models, but we will keep it for our EDA.

3. Exploratory data analysis

3.1 Categorical Data

In our dataset, we have more properties accepting animals, and more properties that are not furnished. We can also observe that we have much more observation in São Paulo than in the other Brazilian cities. This might be explained by the fact that São Paulo is the most populated city in Brazil.

We also observe that if the property is accepting animals, it will have a higher rent. It is also the case, for the furniture feature, if it is already furnished, the rent will be higher. We can also see that rent is the highest in São Paulo.

3.2 Numerical data

We observe that if we own a house (floor = 0 often corresponds to this situation), the rent will be higher. It will also increase if you live on higher floors. As we could have expected, the rent is also increasing with the increasing number of parking spaces.

Globally, a change in the area seems to have a positive impact the rent price. Again, we see that we have more data in São Paulo. By plotting rentagainst hoa and property_tax, we have similar plots as the one below, therefore we are not displaying them.

3.3 Numerical & Categorical data

It is interesting to see that São Paulo and Belo Horizonte have the highest rent. However, Rio de Janeiro total fees is higher than Belo Horizonte. Indeed, homeowners association tax in Rio is twice higher than the one in Belo Horizonte. The city with the highest rent in proportion to the total fees is Belo Horizonte: on average, rent represents about 80.7% of total fees.

For our EDA, we kept the total variable, but to predict the rent, we need to remove it to avoid overfitting (as rent is included in total). As we have seen in the correlation part, it is not necessary to keep fire_insurance, as it is highy correlated with the value we would like to predict. Indeed, it is computed from the rent variable. Therefore, we will remove it.

4. Modelling

In the modelling part of our project, we will start by applying supervised learning methods and compare them. Lastly, two method of unsupervised learning will be implemented and discussed.

4.1 Supervised learning

The supervised learning methods chosen are the following :
- The Linear regression
- The Regression tree
- The KNN

We first proceeded to split at random our data into a training set (80%) and a test set (20%) to be able to assess the performance of our models. In addition, we also created a set with the cross validation method (k-fold = 10) that will be used to train the different models.

4.1.1 Linear regression

We will start by creating a full linear model regression with all our variables, except fire_insurance and total. After that, we will proceed to a forward variables selection from the null linear model, based on AIC criteria.

Full model
\(lm(formula = rent ~ city + area + rooms + bathroom + parking + floor + animal + furniture + hoa + propertytax, data = df.tr)\)

Adjusted R-squared is quite low with a value of 0.585.

Selected model
\(lm(formula = rent ~ bathroom + area + hoa + furniture + city + propertytax + floor + parking + rooms, data = df.tr)\)
Here is the results of the full and the selected model :
  Full model Model Selection
Coefficients Estimates SE t-value p Estimates SE t-value p
(Intercept) 282.07 128.55 2.19 0.028 254.50 124.40 2.05 0.041
city [Campinas] -525.20 112.48 -4.67 <0.001 -530.59 112.30 -4.72 <0.001
city [Porto Alegre] -119.69 103.55 -1.16 0.248 -128.62 103.01 -1.25 0.212
city [Rio de Janeiro] 479.84 101.15 4.74 <0.001 474.06 100.92 4.70 <0.001
city [Sao Paulo] 574.43 82.37 6.97 <0.001 571.43 82.30 6.94 <0.001
area 7.46 0.33 22.51 <0.001 7.46 0.33 22.50 <0.001
rooms 110.49 33.35 3.31 0.001 107.80 33.20 3.25 0.001
bathroom 516.86 31.91 16.20 <0.001 517.08 31.91 16.20 <0.001
parking 175.46 23.04 7.61 <0.001 174.30 23.00 7.58 <0.001
floor [1] -559.30 104.85 -5.33 <0.001 -559.44 104.85 -5.34 <0.001
floor13-18 140.08 115.17 1.22 0.224 138.80 115.16 1.21 0.228
floor [19+] 713.70 169.47 4.21 <0.001 713.83 169.47 4.21 <0.001
floor [2] -599.68 108.09 -5.55 <0.001 -602.74 108.03 -5.58 <0.001
floor [3] -509.21 109.50 -4.65 <0.001 -512.67 109.42 -4.69 <0.001
floor [4] -460.51 116.44 -3.96 <0.001 -463.42 116.38 -3.98 <0.001
floor [5] -408.85 126.06 -3.24 0.001 -411.79 126.01 -3.27 0.001
floor [6] -302.19 130.05 -2.32 0.020 -303.51 130.04 -2.33 0.020
floor [7] -430.32 136.49 -3.15 0.002 -433.72 136.43 -3.18 0.001
floor [8] -465.20 134.32 -3.46 0.001 -467.68 134.28 -3.48 <0.001
floor9-12 -242.26 101.93 -2.38 0.017 -245.32 101.87 -2.41 0.016
animal [yes] -51.22 60.19 -0.85 0.395
furniture [yes] 1120.39 58.05 19.30 <0.001 1124.49 57.85 19.44 <0.001
hoa 0.43 0.03 14.33 <0.001 0.44 0.03 14.37 <0.001
property_tax 0.47 0.04 10.78 <0.001 0.47 0.04 10.78 <0.001
Observations 8260 8260
R2 / R2 adjusted 0.586 / 0.585 0.586 / 0.585


We end up with a model without animal. The adjusted R-squared is still low and remains at 58.5 %. It means that all variables, except animal, are significant based on the AIC.
We can see first of all that rent decreases significantly with the cities of Campinas and Porto Alegre. The rents will be cheaper on average and especially in Campinas -525.20 BRL (with Belo Horizonte as reference point) everything else being equal. Rents in Rio de Janeiro and São Paulo will be on average significantly higher and São Paulo will have on average the highest rents +574.43 BRL, again, everything else being equal.
We can see that, logically, area, rooms, parking and especially bathrooms are positively associated to an increase in rent, everything else being equal. This is not something we can tell from the linear regression but we have been able to see in our EDA that all these variables, the explanatory ones, are positively correlated.
On average, there is a decrease in rent depending on the floor. From the first floor to the twelfth, basement being the reference point, the rent is associated with a a lower value. Nevertheless, from floor thirteenth to eighteenth, rent starts to be positively impacted, the impact is stronger from the nineteenth floor and more (basement still being the reference level). It is surprising to see that floor zero is on average much more expensive than the first twelfth floors, everything else being equal. Indeed, properties rent, tend to be higher as the floors increase. As mentioned earlier floor zero can be houses, that have higher rents than most flats. A furnished residence will have a considerably higher rent +1120.39 BRL than an unfurnished residence.

We decided to train our model with cross-validation in order to prevent overfitting. We can see, in the following table, that the R-squared oscillates between 48.3% and 64.2% with 10 folds.

#>    RMSE Rsquared  MAE Resample
#> 1  2259    0.573 1409   Fold01
#> 2  2403    0.566 1444   Fold02
#> 3  2093    0.598 1368   Fold03
#> 4  2018    0.642 1330   Fold04
#> 5  2241    0.549 1435   Fold05
#> 6  2205    0.592 1405   Fold06
#> 7  2520    0.483 1571   Fold07
#> 8  1993    0.628 1325   Fold08
#> 9  2193    0.617 1480   Fold09
#> 10 2081    0.569 1320   Fold10


Now, we decided to fit our predicted values against the observations.

We can observe that our linear model makes less error in predicting rents up to more or less 5’000 BRL than in predicting rents above this limit. The graph above shows us that after 5’000 BRL the observations become even more spread out and difficult to predict.

4.1.2 Regression tree

We start by plotting an unpruned regression tree. We can observe that the variables used to predict rent are area, bathroom, property_tax and hoa.

We can see the the model does not have a great accuracy especially, when the observed rents are higher.

#> CART 
#> 
#> 8260 samples
#>   10 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 7434, 7434, 7434, 7435, 7434, 7435, ... 
#> Resampling results across tuning parameters:
#> 
#>   cp      RMSE  Rsquared  MAE 
#>   0.0494  2458  0.478     1700
#>   0.0656  2561  0.433     1793
#>   0.3950  3035  0.380     2209
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was cp = 0.0494.

The train caret function used with the cross validation has chosen a model pruned with a best tuned cp of 0.049.
Therefore here we plot the pruned tree based on the optimal cp.  We can see that the tree chosen has now only 3 branches.

#> Call:
#> rpart(formula = rent ~ ., data = df.tr)
#>   n= 8260 
#> 
#>       CP nsplit rel error xerror   xstd
#> 1 0.3950      0     1.000  1.000 0.0292
#> 2 0.0656      1     0.605  0.609 0.0212
#> 3 0.0494      2     0.539  0.554 0.0209
#> 
#> Variable importance
#>         area property_tax     bathroom      parking        rooms 
#>           27           17           17           15           12 
#>          hoa         city 
#>           10            1 
#> 
#> Node number 1: 8260 observations,    complexity param=0.395
#>   mean=3.95e+03, MSE=1.16e+07 
#>   left son=2 (5666 obs) right son=3 (2594 obs)
#>   Primary splits:
#>       area         < 152  to the left,  improve=0.395, (0 missing)
#>       property_tax < 364  to the left,  improve=0.379, (0 missing)
#>       bathroom     < 2.5  to the left,  improve=0.359, (0 missing)
#>       parking      < 2.5  to the left,  improve=0.282, (0 missing)
#>       rooms        < 3.5  to the left,  improve=0.231, (0 missing)
#>   Surrogate splits:
#>       bathroom     < 2.5  to the left,  agree=0.867, adj=0.575, (0 split)
#>       parking      < 2.5  to the left,  agree=0.850, adj=0.524, (0 split)
#>       property_tax < 404  to the left,  agree=0.840, adj=0.490, (0 split)
#>       rooms        < 3.5  to the left,  agree=0.830, adj=0.460, (0 split)
#>       hoa          < 1690 to the left,  agree=0.789, adj=0.327, (0 split)
#> 
#> Node number 2: 5666 observations
#>   mean=2.5e+03, MSE=3.44e+06 
#> 
#> Node number 3: 2594 observations,    complexity param=0.0656
#>   mean=7.11e+03, MSE=1.48e+07 
#>   left son=6 (1385 obs) right son=7 (1209 obs)
#>   Primary splits:
#>       property_tax < 608  to the left,  improve=0.1630, (0 missing)
#>       area         < 300  to the left,  improve=0.1270, (0 missing)
#>       bathroom     < 3.5  to the left,  improve=0.0899, (0 missing)
#>       hoa          < 2790 to the left,  improve=0.0535, (0 missing)
#>       parking      < 2.5  to the left,  improve=0.0508, (0 missing)
#>   Surrogate splits:
#>       hoa      < 2040 to the left,  agree=0.706, adj=0.369, (0 split)
#>       bathroom < 3.5  to the left,  agree=0.667, adj=0.285, (0 split)
#>       parking  < 2.5  to the left,  agree=0.650, adj=0.248, (0 split)
#>       area     < 300  to the left,  agree=0.625, adj=0.194, (0 split)
#>       city     splits as  LLLLR,    agree=0.623, adj=0.190, (0 split)
#> 
#> Node number 6: 1385 observations
#>   mean=5.65e+03, MSE=9.71e+06 
#> 
#> Node number 7: 1209 observations
#>   mean=8.77e+03, MSE=1.54e+07

On this summary we see that for node 1, if area is not available the model will use bathroom to predict the rent. For node 3 if property_tax is not available the model will use hoa.

Variable importance
Overall
property_tax 0.543
area 0.522
bathroom 0.449
parking 0.333
rooms 0.231
hoa 0.054
cityCampinas 0.000
cityPorto Alegre 0.000
cityRio de Janeiro 0.000
citySao Paulo 0.000
floor1 0.000
floor13-18 0.000
floor19+ 0.000
floor2 0.000
floor3 0.000
floor4 0.000
floor5 0.000
floor6 0.000
floor7 0.000
floor8 0.000
floor9-12 0.000
animalyes 0.000
furnitureyes 0.000


property_tax and area are the two most important variables, this is why with a tree of length three, the model uses these features. bathroom is the third most important, but it is rather used as surrogate split probably due to the fact that it highly correlated with area (0.75).

When we look at the predictions versus observed values, it looks almost like the unpruned graph, meaning pruning the tree simplifies the model but did not affect too much the errors.

4.1.3 KNN

The third, and last supervised learning model we performed is the K-Nearest Neighbors (KNN). We decided to let the model set the optimal k parameter (the optimal number of neighbors taken into account to allocate an observation to a class).

#> k-Nearest Neighbors 
#> 
#> 8260 samples
#>   10 predictor
#> 
#> Pre-processing: centered (23), scaled (23) 
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 7434, 7434, 7434, 7435, 7434, 7435, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   RMSE  Rsquared  MAE 
#>    5  2307  0.546     1445
#>    7  2265  0.561     1423
#>    9  2242  0.569     1411
#>   11  2242  0.570     1410
#>   13  2243  0.570     1411
#>   15  2240  0.572     1409
#>   17  2241  0.572     1411
#>   19  2242  0.573     1411
#>   21  2237  0.575     1409
#>   23  2238  0.576     1410
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 21.

The model selected k with a value of 21, based on the RMSE. A high value for k provides a robust model, however the goodness-of-fit is not excellent: the R squared is lower than 0.6.

Finally, we plot the predictions of this model:

We can observe that the prediction is better for smaller rent values.

4.1.4 Scoring and model selection

Models R2 RMSE MAE
KNN 0.596 2271 1436
Linear Regression 0.595 2244 1435
Regression Tree 0.499 2500 1773

Comparing the scores of our three models we can see that the Linear Regression and KNN are equivalent regarding their R2 but the Linear Regression has a lower RMSE than the two other models. By default, Regression Tree is the worst. However even the best model does not perform very well on our data set. As we have seen previously his adjusted R2 is 58.5% which is not so good. It is important to highlight the fact that this last R2 is different from the one used in the model comparison.

4.1.5 Applying our models on observations

To illustrate our three models, we decided to apply them on two observations: the highest rent value and one value close to the median.

Highest Rent
city area rooms bathroom parking floor animal furniture hoa rent property tax
2156 Sao Paulo 700 4 7 8 0 yes no 0 45000 8750
Median Rent
city area rooms bathroom parking floor animal furniture hoa rent property tax
5170 Porto Alegre 62 2 2 2 6 yes no 570 1390 60

Linear Model

Unfortunately, as we saw previously on the linear regression graph, our model makes considerable error in predicting rents above ~5000 BRL. The model predicts a rent of ~15’577 BRL instead of 45’000 BRL, which is three times lower than it should.
For the median one, the linear regression is closer to the observed value 1’390 BRL but still makes large errors and predicts a rent of ~2’159 BRL which is an overestimate of 770 BRL.

Predicted Observed
Highest Rent 15’577 45’000
Median Rent 2’159 1’390

Pruned Regression Tree

For the first instance, as the area of the property is 700 m2 and the property_tax is 8’750 BRL, the regression tree model indicates to go on the right two times. The predicted rent is 8’771 BRL, while the true rent is 45’000 BRL.
For the second instance, it would predict 2’501 BRL because we would go left at the first node. Indeed, the area of this property is lower than 153m2. The prediction error is quite large as the observed rent is 1’390 BRL, almost twice smaller.

We can see that this model makes already large errors with a typical value of the data set. It does an even far worst job with the greatest observation. This explains why regression tree was the worst model among the others in the previous comparison.

KNN

With KNN, the prediction of the maximum observation is far from reality. We predict 13’690 BRL instead of 45’000 BRL, around three times under the real value.
KNN is more accurate with the median observation; it predicts 1’945 BRL while the actual value is 1’390 BRL.

KNN Predicted Observed
Highest Rent 13’690 45’000
Median Rent 1’945 1’390

4.2 Unsupervised learning

In this part, we will be applying two unsupervised learning methods :
- The PAM method for Clustering
- The PCA method for Dimension reduction

4.2.1 Clustering

As our data contains both categorical and numerical variables, all other clustering methods seen in class were not adapted to our needs. The hierarchical clustering methods can be importantly affected by extreme observations which is called hiding effect. As we have seen, although we have deleted meaningless outliers, we still have many extreme observations. In our case, the PAM method, thanks to the fact that it uses the medoids, is more robust and can be used for numerical and categorical features unlike K-Means method.

To start we scaled our data to make sure that the clustering does not get influenced by the differences in variables scales. Indeed, we have area, hoa, rent and proprety tax that can take very high values while variables like the number of rooms or bathroomhave maximums of 13 and 10.
Table 4.2.1.1 : Scaled dataset used for unsupervised learning methods
city area rooms bathroom parking floor animal furniture hoa rent property tax
Sao Paulo -0.550 -0.462 -0.897 -0.403 0.364 yes yes 0.995 -0.193 -0.195
Sao Paulo 1.297 1.250 1.225 -1.029 2.814 yes no 0.235 0.291 2.042
Porto Alegre -0.476 -1.319 -0.897 -0.403 0.176 yes no 0.060 -0.339 -0.501

To determine the number of clusters to be set, we used the fviz_nbclust function. As the function only takes a numeric matrix or a data frame as input, we turned the categorical variables into dummies. Since we had three categorical columns; one with five levels and the last two with two each, we passed from eleven variables to fourteen (deleting the original columns).

Table 4.2.1.2 : Dataset with dummies used for fviz_nbclust()
area rooms bathroom parking floor hoa rent property tax city Campinas city Porto Alegre city Rio de Janeiro city Sao Paulo animal yes furniture yes
-0.550 -0.462 -0.897 -0.403 0.364 0.995 -0.193 -0.195 0 0 0 1 1 1
1.297 1.250 1.225 -1.029 2.814 0.235 0.291 2.042 0 0 0 1 1 0
-0.476 -1.319 -0.897 -0.403 0.176 0.060 -0.339 -0.501 0 1 0 0 1 0

Then, we have chosen the silhouette statistic in order to maximize the goodness of fit of the clustering. As displayed on the graph below, we found out that k=2 would be the number of clusters maximizing the average silhouette of the clusters. We will not be displaying the results here but we also checked the within-cluster sum of square method [WSS] and the elbow minimizing it was also k=2.

So, we used the pam function specifying k=2, following our previous results. Here are the two medoids:

Table 4.2.1.3 : PAM medoids with k=2
cluster id city area rooms bathroom parking floor animal furniture hoa rent property tax
1 8797 Rio de Janeiro -0.439 -0.462 -0.897 -0.403 -0.201 yes no -0.423 -0.660 -0.356
2 4749 Sao Paulo 0.411 0.394 1.225 0.848 0.553 yes no 0.674 0.828 0.468

We can see that the medoid of cluster one seems to be a smaller property than the medoid of cluster two. With the following plots we can see that this is, indeed, confirmed. Cluster one is characterized by houses that could be called standard whereas cluster two includes houses that are bigger and exceptional. As a matter of fact we can see more outbound values in cluster two; equally in terms of houses characteristics than in financial charges. The graphical representation of the categorical variables by clusters does not give us much more information but our hypothesis of “exceptional real estate” in cluster two is once again confirmed as there are a fewer of it. We can highlight the fact that São Paulo is the only city in which the difference, between the number of houses of each cluster, is the smaller. This can be explained by the fact that São Paulo is a dynamic financial center in Brazil since many years.

Finally, we can have a look to the silhouette plot. We have 6’946 observations in the cluster one and the 3’376 remaining in cluster two. The first silhouette statistic is well above the one of cluster two with respectively 0.43 and 0.08. We can see that all the observations of cluster one are correctly clustered, while cluster two have a bit less than half of its observations misclustered. The average silhouette is of 0.32 which is not really impressive. Nevertheless, we should not over interpret it as we do not have a point of comparison.
Silhouette Plot

4.2.2 Dimension reduction


The PCA method is not really appropriate for categorical variables since it is a rotation of data from one coordinate to another. Numerical data is preferable for this method. We first used our data with categorical variables turned into dummies but in total it added six dummies over fourteen variables. There is not a lot of variation in dummies and adding six of them would not be suitable for a model which aims to capture as much variance as it can in as fewer components as possible. To overcome it, we performed the PCA method exclusively on the numerical variables of our data set.

75-95% of the variance is an usual threshold to reach when selecting dimensions. Here, we decided to select dimension one to three, in order to have a cumulative variance of 78,1% (See eigenvalues by clicking on the plots). We could also choose to add dimension four to have 84.33% of variation. The PCA would still do a good dimension reduction by bringing the variables down by half. We did not take dimension four for report space reasons, but the three dimensions already capture enough variation to illustrate the method.

In the following plot, we can see that almost all the variables are positively correlated to dimension one. The exceptions are:
- floor which is extremely correlated to dimension two and absolutely not to the first one.
- hoa which is correlated to both first dimensions
We still observe it on the second plot. property_tax is negatively correlated to dimension three. The other variables are not or really slightly correlated to that last.

In this plot we can observe the contribution of each variable in the different dimensions. It is interesting to see that our previous statements are observed here as well. The only exception is the contribution of variable floor to the third dimension up to 20%. An observation we could add, is that all the variables contribute in at least one dimension above the threshold of ~12%. In fact, we know the variation is conserved in the Principal Component Analysis.
To conclude our unsupervised analysis, we decided to plot the observations on the selected dimensions and to color it by cluster. In the first graph, clusters are part by dimension 2 quite clearly. Our previous statements are confirmed here again. The properties from cluster one are agglomerated, whereas cluster two houses are more dispersed. Outliers, which could be considered as exceptional properties, are all in the cluster two.

5. Conclusion

An important lesson to be drawn from this project is that the quality of a data set influences the predictive quality of the supervised learning models. Although, wide range models of data were tested due to the specific data set for this project, however, the predictive quality is just average.

This limitation is due to the presence of outliers in the data set and the relevance of some more or less correlated variables. There were some questions about the different levels of the city variable. As these are megacities, rent probably differs in the neighboring areas. If this was considered, precision would have been otherwise superior. Furthermore, the following variables could have provided interesting information for rent prediction: age of the property, type of property, crime rate, and the distance to the city center.

In the unsupervised learning section, the two methods were ineffective in the comprehension of the data. Nevertheless, it is highlighted that PCA method was not appropriate in contrast to PAM. The transformation of categorical variables into dummies is inconvenient when they are numerous and exclusive consideration of numerical variables loses a lot of variance explanation.

In conclusion, recommended model is the linear regression model of supervised learning. This is the model with the smallest errors [RMSE & MAE] and the most robust accuracy. Therefore, it’s not surprising that this model is mostly used by estate agencies.